In this notebook we evaluate the Alnus pollen forecast with a Deep Learning model. The notebook is an adjusted version of the one used for this paper, and might still contain some legacy code and comments as a consequence. The ML model is compared to three other timeseries during the same period:

Verification of the model happens exclusively at the station level so far, as no spatial verification method has been widely accepted in the pollen community. Pollen surface level concentrations (i.e. level k=80 in Cosmo) are evaluated.

The Data

The observations are provided at 9-14 different stations (depending on the season), whereof one is excluded from the analysis: Davos is high up in the mountains and pollen measurements are almost always zero.

The following settings are crucial and should always be remembered when running the chunks below:

Observations of low pollen concentrations are uncertain. Therefore, times- tamps with modelled or measured values < 10 m-3 are removed from all three data sets for the numerical analysis. This symmetrical exclusion is necessary to make sure that no artificial biases are introduced. For the analyses carried out for specific health impact based categories (i.e. concentration bins), timestamps with measured values < 10 m-3 are removed only (i.e. modelled values are allowed to be < 10 m-3. This asymmetrical exclusion is justified for the investigation of relative changes in categorical metrics. In the time series plot (Figure 1) and the boxplot (Figure 5) values < 10 are not removed at all for better visibility. Whether low values <10 were removed are not will be denoted in all figure captions and should allow the reader to get a full picture of the data.

Residual Analysis

First we compare the three data sets with the traditional ANOVA approach. Statistical inference (p-values, confidence intervals, . . . ) is only valid if the model assumptions are fulfilled. So far, this means (many paragraphs are quoted from Lukas Meier ETH - Script Applied Statistics ANOVA Course):

The first assumption is most crucial (but also most difficult to check). If the independence assumption is violated, statistical inference can be very inaccurate. In the ANOVA setting, the last assumption is typically not as important compared to a regression setting, as we are typically fitting “large” models.

Are the errors normally distributed?

In a QQ-plot we plot the empirical quantiles (“what we see in the data”) vs. the theoretical quantiles (“what we expect from the model”). The plot should show a more or less straight line if the distributional assumption is correct. By default, a standard normal distribution is the theoretical “reference distribution”.

They are definitely not and we have to do some adjustments. So for the following plot we logarithmic the data to deal with the right-skewedness. The best results were achieved by first logarithmic the data and then taking the square root.

Do the errors have mean zero? & Is the error variance constant?

The Tukey-Anscombe plot plots the residuals vs. the fitted values. It allows us to check whether the residuals have constant variance and whether the residuals have mean zero (i.e. they don’t show any deterministic pattern). We don’t plot the smoothing line as loess (and other) algorithms have issues when the same value is repeated a large number of times (jitter did not really help).

Are the errors independent?

If the data has some serial structure (i.e., if observations were recorded in a certain time order), we typically want to check whether residuals close in time are more similar than residuals far apart, as this would be a violation of the independence assumption. We can do so by using a so-called index plot where we plot the residuals against time. For positively dependent residuals we would see time periods where most residuals have the same sign, while for negatively dependent residuals, the residuals would “jump” too often from positive to negative compared to independent residuals.

Summary: Residual Analysis shows that the assumptions of “normal” statiscal methods are validated even for log(daily values). It is therefore suggested to continue the analysis with robust and simple metrics.

Visual Assessment

Basic Plots

General overview of the daily concentration values as represented in the four timeseries. The separate lines represent concentrations at individual stations as measured or modelled in the year 2022.

In this plot the reader can compare the Alnus pollen season at each station individually.

The overall distribution of the concentration values is shown in the following: First, as a boxplot and second as a histogram.

Correlation Plots

The correlation between the Model and Measurements can be calculated easily and then the CI and p-values must be adjusted for multiple comparison. The corr-test function from the psych handily offers this functionality.

Careful the correlation coefficients method have some serious shortcomings:

The correlation coefficient measures linear agreement–whether the measurements go up-and-down together. Certainly, we want the measures to go up-and-down together, but the correlation coefficient itself is deficient in at least three ways as a measure of agreement. (http://www.jerrydallal.com/LHSP/compare.htm)

  • The correlation coefficient can be close to 1 (or equal to 1!) even when there is considerable bias between the two methods. For example, if one method gives measurements that are always 10 units higher than the other method, the correlation will be 1 exactly, but the measurements will always be 10 units apart.
  • The magnitude of the correlation coefficient is affected by the range of subjects/units studied. The correlation coefficient can be made smaller by measuring samples that are similar to each other and larger by measuring samples that are very different from each other. The magnitude of the correlation says nothing about the magnitude of the differences between the paired measurements which, when you get right down to it, is all that really matters.
  • The usual significance test involving a correlation coefficient– whether the population value is 0–is irrelevant to the comparability problem. What is important is not merely that the correlation coefficient be different from 0. Rather, it should be close to (ideally, equal to) 1!

A good summary of the methods and their shortcomings can be found here: https://www.statisticssolutions.com/correlation-Pearson-Kendall-spearman/

Altman Bland Plots

The well established AB-method for clinical trials can be used here as well to compare the means and differences between datasets. If the points lie within the two SD-line for the differences the datasets can be assumed to be strongly associated with each other.

Density Plots

These plots allow to observe the error for different concentration categories.

Statistical Assessment

First, various metrics are compared where the pollen concentrations are considered a continuous numerical variable.

Alnus
R2 Bias SD MAE RMSE MSLE RMSLE type
0.1740684 -33.07869 168.5172 88.53139 171.4343 0.2208793 0.4699779 cosmo_input
0.3134525 23.30544 141.2243 80.35080 142.8827 0.1605567 0.4006953 cosmo_target
0.1384928 -31.93171 156.1449 79.17168 159.1001 0.1868507 0.4322623 ml

Second, the values will be converted into health impact based buckets. The impact classes have been defplotlistined https://service.meteoswiss.ch/confluence/x/1ZG4 Now we can investigate various metrics that are typically used for categoric variables. The Kappa metric is explained here and was chosen as the most meaningful metric for this analysis: https://towardsdatascience.com/multi-class-metrics-made-simple-the-kappa-score-aka-cohens-kappa-coefficient-bdea137af09c

Alnus
type Accuracy Kappa
cosmo_input 0.5416667 0.1607799
cosmo_target 0.6138889 0.3014685
ml 0.5722222 0.1548136

To takes into account that the health impact levels are ordered. We can treat them as numerical values from 0:nothing - 4: very strong.

Alnus
MAE_cosmo_input MAE_cosmo_target MAE_ml
0.5333333 0.4111111 0.4805556

The following table could be used in the appendix.

Reference Event No Event Predicted Event A B No Event C D The formulas used here are:

Alnus
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy class type
NA 0.825 NA NA 0.000 NA NA 0.000 0.000 0.175 NA Class: weak cory_alnu
NA 0.933 NA NA 0.000 NA NA 0.000 0.000 0.067 NA Class: weak cosmo_target
NA 0.922 NA NA 0.000 NA NA 0.000 0.000 0.078 NA Class: weak ml
0.617 0.576 0.793 0.363 0.793 0.617 0.694 0.725 0.447 0.564 0.596 Class: medium cory_alnu
0.648 0.788 0.889 0.459 0.889 0.648 0.749 0.725 0.469 0.528 0.718 Class: medium cosmo_target
0.663 0.566 0.801 0.389 0.801 0.663 0.725 0.725 0.481 0.600 0.614 Class: medium ml
0.471 0.839 0.405 0.872 0.405 0.471 0.435 0.189 0.089 0.219 0.655 Class: strong cory_alnu
0.485 0.757 0.317 0.863 0.317 0.485 0.384 0.189 0.092 0.289 0.621 Class: strong cosmo_target
0.471 0.736 0.294 0.857 0.294 0.471 0.362 0.189 0.089 0.303 0.603 Class: strong ml
0.065 0.960 0.133 0.916 0.133 0.065 0.087 0.086 0.006 0.042 0.513 Class: verystrong cory_alnu
0.613 0.930 0.452 0.962 0.452 0.613 0.521 0.086 0.053 0.117 0.771 Class: verystrong cosmo_target
0.032 0.982 0.143 0.915 0.143 0.032 0.053 0.086 0.003 0.019 0.507 Class: verystrong ml

Robust Contrasts with Confidence Intervals

https://www.researchgate.net/publication/282206980_nparcomp_An_R_Software_Package_for_Nonparametric_Multiple_Comparisons_and_Simultaneous_Confidence_Intervals The R package nparcomp implements a broad range of rank-based nonparametric methods for multiple comparisons. The single step procedures provide local test decisions in terms of multiplicity adjusted p-values and simultaneous confidence intervals. The null hypothesis H0: p = 1/2 is significantly rejected at 5% level of significance for many pairwise comparisons. Whenever the p-Value is < than 5% = the confidence interval contains 0.5 -> the effect from the factor trap is not statistically meaningful. The Estimator can also be interpreted as a proxy for the relative difference in median between Model and Measurements. If the Estimator is > 0.5 then the model tends to have larger measurements.

##
##  #------Nonparametric Multiple Comparisons for relative contrast effects-----#
##
##  - Alternative Hypothesis:  True relative contrast effect p is not equal to 1/2
##  - Type of Contrast : Dunnett
##  - Confidence level: 95 %
##  - Method = Logit - Transformation
##  - Estimation Method: Pairwise rankings
##
##  #---------------------------Interpretation----------------------------------#
##  p(a,b) > 1/2 : b tends to be larger than a
##  #---------------------------------------------------------------------------#
## 
Robust Contrasts and Confidence Intervals for Alnus Pollen Measurements
Taxon Comparison Estimator Lower Upper pValue
Alnus p( measurement , cosmo_input ) 0.446 0.390 0.504 0.075977235171957
Alnus p( measurement , cosmo_target ) 0.564 0.506 0.620 0.0262266449055567
Alnus p( measurement , ml ) 0.532 0.474 0.589 0.424212503683236

FINAL DECISION:

name cosmo_target ml change
Accuracy 0.61 0.57 -6.79%
Kappa 0.30 0.15 -48.65%
MAE_cat 0.41 0.48 16.89%
Bias 23.31 -31.93 37.01%
SD 141.22 156.14 10.57%
MAE 80.35 79.17 -1.47%
RMSLE 0.40 0.43 7.88%

)